The data set is composed of an equilibred number of positive /negative movie review. It will be on the form of 3 different CSV files.
The goal is to predict the predicted_label column. The prediction quality is measured by the precision metrics.
Results should be a txt file or csv file with 1 column : the predicted_class {0,1} as shown in this toolkit. You have to keep the original order of the datasets.
The first things to do is to dawload all the data at the website : https://competitions.codalab.org/competitions/8131#learn_the_details-description
To rename them (remove the keys 'datasets_None_0b3a301a-be2e-4f21-8be9-dfa5c56439c4') to their original names:
and place them in a 'data/' folder.
In [1]:
from __future__ import division, print_function
import pandas as pd
import numpy as np
In [2]:
data_dir = 'data/'
In [3]:
# Load Original Data / contains data + labels 10 k
train = pd.read_csv("../data/train.data")#.drop('id',axis =1 )
# Your validation data / we provide also a validation dataset, contains only data : 5k
valid = pd.read_csv("../data/valid.data")#.drop('id',axis =1 )
# final submission
test = pd.read_csv("../data/test.data")#.drop('id',axis =1 )
In [4]:
print("train size", len(train))
print("public test size", len(valid))
print("private test size",len(test))
In [5]:
# creating arrays from pandas dataframe
X_train = train['review'].values
y_train = train['label'].values
X_valid = valid['review'].values
X_test = test['review'].values
print("raw text : \n", X_train[0])
print("label :", y_train[0])
In [6]:
print(len(X_test))
Trainning and testing the model with cross validation.
In [7]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import TfidfVectorizer
# creating random forest classifier
rfst = RandomForestClassifier(n_estimators = 100)
# TfIdf Vectorizer with default parameters
myTfidfVect = TfidfVectorizer(stop_words='english', max_features=30000)
X_train_transformed = myTfidfVect.fit_transform(X_train)
The next cell may take some time.
In [8]:
from sklearn.cross_validation import cross_val_score
scores = cross_val_score(rfst, X_train_transformed, y_train,
scoring='accuracy', cv=5)
print('accuracy :', np.mean(scores), '% +/-', np.std(scores), '%')
Trainning the model on the complete trainning dataset.
In [9]:
rfst.fit(X_train_transformed, y_train)
print('Model trainned.')
Get the predictions.
In [10]:
X_valid_transformed = myTfidfVect.transform(X_valid)
X_test_transformed = myTfidfVect.transform(X_test)
In [11]:
prediction_valid = rfst.predict(X_valid_transformed)
prediction_test = rfst.predict(X_test_transformed)
In [12]:
pd.DataFrame(prediction_valid[:5], columns=['prediction'])
Out[12]:
Save the results.
In [13]:
import os
if not os.path.isdir(os.path.join(os.getcwd(),'results')):
os.mkdir(os.path.join(os.getcwd(),'results'))
np.savetxt('results/valid.predict', prediction_valid, fmt='%d')
np.savetxt('results/test.predict', prediction_test, fmt='%d')
The last operation is to zip the results. Zip only the 'valid.predict' and the 'test.predict' files not the results directory !